Add ContextBench harness core#120
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cad646d9d9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| }; | ||
| const rawTrace = { | ||
| executor, | ||
| model: executor === 'claude' ? model : 'fake-executor', |
There was a problem hiding this comment.
Preserve actual executor model in raw trace
Set rawTrace.model to the selected model for all real executors, not just Claude. As written, non-Claude runs (codex, gemini, opencode) are recorded as "fake-executor" while the manifest row stores taskExecution.model from --model, so Phase 42 provenance checks (rawTrace.model === row.taskExecution.model) will fail even when the run is otherwise valid, blocking claim-grade verification for those lanes.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 867ac70: raw traces now record model: executor === 'fake' ? 'fake-executor' : model, so Codex/Gemini/OpenCode preserve the selected model. The adapter smoke test now asserts
awTrace.model === row.taskExecution.model and executor consistency for all three adapters.
| return { | ||
| laneId: laneCard.laneId, | ||
| proven, | ||
| sourceKind: telemetry?.proofSource ? 'env_override' : 'not_captured', |
There was a problem hiding this comment.
Emit claim-eligible lane isolation source kind
When telemetry evidence is present, this always emits sourceKind: 'env_override', but the Phase 42 gate explicitly rejects env_override as insufficient lane-isolation proof. That means rows with otherwise good observed-tools evidence can never satisfy lane isolation in claim verification, because the producer and verifier disagree on the accepted source kind.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Already addressed in the current branch: lane telemetry preserves sourceKind from the evidence source, including proxy and ranscript, instead of collapsing everything to �nv_override. The Phase 42 gate still rejects �nv_override as diagnostic-only, so claim-grade rows require explicit proxy/transcript evidence.
There was a problem hiding this comment.
Clarifying previous reply: this was already addressed in the branch. Lane telemetry now preserves the evidence source kind from the telemetry payload, including proxy and transcript, instead of treating every telemetry-backed row as env_override. The verifier still rejects env_override for claim-grade lane proof.
Greptile SummaryThis PR adds the non-claim-bearing ContextBench harness: runner, retrieval gate, structured answer parser, scoring module, artifact utilities, evidence gate, trajectory normaliser, and a comprehensive test suite. The two previously-flagged regressions (scoring field mismatches and hardcoded
Confidence Score: 4/5Safe to merge as non-claim-bearing infrastructure; the P1 scorer/gate artifact gap must be resolved before claim-bearing runs are attempted. One P1 defect: the TypeScript scorer writes stdout/stderr as inline text but the evidence gate requires stdoutPath/stderrPath file paths, so any real scorer artifact will permanently fail hasOfficialEvaluatorProof. Since claimAllowed is false throughout this PR the gate is never exercised end-to-end yet, keeping the PR safe to land as infrastructure — but the gap must be closed before the claim path is activated. src/eval/contextbench-scoring.ts — ContextBenchScoreResult must add stdoutPath/stderrPath and the function must write stdout/stderr to separate log files.
|
| Filename | Overview |
|---|---|
| src/eval/contextbench-scoring.ts | Scorer emits inline stdout/stderr text but evidence gate requires stdoutPath/stderrPath file paths — gate will always emit official_evaluator_missing for any artifact produced by this module. |
| src/eval/contextbench-evidence-gate.ts | Evidence gate logic is thorough and well-structured; all gate checks (official evaluator, lane isolation, setup/index cost, runner provenance, denominator contract) are coherent and correctly gated by evidenceMode. |
| src/eval/contextbench-artifacts.ts | buildManifestRow now accepts caller-provided setupIndex; scoring fields are deliberately hardcoded to non-claim-bearing values for Phase 38 smoke runs, consistent with test assertions. |
| src/eval/contextbench-trajectory.ts | Trajectory normalisation is correct; pred_steps[0].spans and pred_spans share the same object reference, which could be problematic if consumers mutate the trajectory output. |
| tests/contextbench-phase42-evidence-gate.test.ts | Comprehensive gate test coverage; passingArtifacts() constructs stdoutPath/stderrPath manually, masking the gap between the TypeScript scorer's output and the gate's requirements. |
| tests/contextbench-scoring.test.ts | Tests cover scorer return value fields and fallback metadata well, but do not verify that the written score JSON artifact satisfies the evidence gate's stdoutPath/stderrPath requirements. |
| tests/contextbench-runner-contract.test.ts | Runner contract tests cover fixture validation, fake-executor smoke runs, manifest append semantics, and setupIndex propagation cleanly. |
Sequence Diagram
sequenceDiagram
participant Runner as contextbench-runner.mjs
participant Scorer as scoreWithOfficialEvaluatorFirst (TS)
participant Disk as Score Artifact (score.json)
participant Gate as evaluateContextBenchEvidenceGate
Runner->>Scorer: run official evaluator
Scorer->>Disk: writeJson(outputPath, { stdout, stderr, exitCode, ... })
Note over Disk: stdoutPath/stderrPath absent
Runner->>Gate: artifactsByRunId[runId].score = parse(score.json)
Gate->>Gate: hasOfficialEvaluatorProof(row, score, hashes)
Note over Gate: checks score.stdoutPath → undefined → returns false
Gate-->>Runner: official_evaluator_missing failure
Reviews (2): Last reviewed commit: "fix(test): harden ContextBench schema cl..." | Re-trigger Greptile
| missingEvidenceFiles: string[]; | ||
| unsupportedClaim: boolean; | ||
| falseReady: boolean; | ||
| reasons: string[]; | ||
| } | ||
|
|
||
| function writeJson(filePath: string, value: unknown): void { | ||
| mkdirSync(path.dirname(filePath), { recursive: true }); | ||
| writeFileSync(filePath, `${JSON.stringify(value, null, 2)}\n`, 'utf8'); | ||
| } | ||
|
|
||
| export async function scoreWithOfficialEvaluatorFirst( | ||
| params: OfficialEvaluatorParams | ||
| ): Promise<ContextBenchScoreResult> { | ||
| const args = [ | ||
| '-m', | ||
| 'contextbench.evaluate', | ||
| '--gold', | ||
| params.goldPath, | ||
| '--pred', | ||
| params.predictionPath | ||
| ]; | ||
| if (params.cachePath) args.push('--cache', params.cachePath); | ||
| args.push('--out', params.outputPath); | ||
| const command = `python ${args.join(' ')}`; | ||
| const result = await params.runner('python', args, params.cwd); | ||
| if (result.status === 0) { | ||
| const score = { | ||
| status: 'completed' as const, | ||
| mode: 'official_evaluator' as const, | ||
| claimBearing: true, |
There was a problem hiding this comment.
ContextBenchScoreResult is incompatible with ContextBenchScoreEvidence
scoreWithOfficialEvaluatorFirst returns (and writes) a score object with exitStatus, but ContextBenchScoreEvidence (consumed by hasOfficialEvaluatorProof in the evidence gate) expects exitCode. Additionally, officialEvaluatorInvoked is absent from ContextBenchScoreResult. Because of these two mismatches, any TypeScript harness that stores this function's return value as the score artifact will cause hasOfficialEvaluatorProof to always return false — permanently blocking the claim gate even for a valid run.
The runner .mjs correctly emits both exitCode and officialEvaluatorInvoked: true inline (lines ~1091–1120), but the TypeScript module diverges silently. The two representations need to be reconciled.
There was a problem hiding this comment.
Addressed in the current branch before this latest push: ContextBench scoring now emits the gate-compatible evaluator fields, including exitCode, officialEvaluatorFirst, officialEvaluatorAttempted, officialEvaluatorInvoked, command, outputPath, stdoutPath, and stderrPath. The scorer tests cover claimAllowed false versus true behavior and the metadata contract.
| return content.split('\n').map((line) => JSON.parse(line) as ContextBenchRunManifestRow); | ||
| } | ||
|
|
||
| export function buildManifestRow(params: { | ||
| runId: string; | ||
| protocolVersion: string; | ||
| protocolHash: string; | ||
| taskManifestHash: string; | ||
| laneCard: ContextBenchLaneToolCard; | ||
| task: ContextBenchTaskIdentity; | ||
| repeatIndex: number; | ||
| status: ContextBenchTerminalStatus; | ||
| startedAt: string; | ||
| completedAt: string; | ||
| paths: ArtifactPathSet; | ||
| hashes: Record<string, string>; | ||
| executor: ContextBenchExecutor; | ||
| model: string; | ||
| timeoutSeconds: number; | ||
| maxContextTokens: number; | ||
| maxAnswerTokens: number; | ||
| }): ContextBenchRunManifestRow { | ||
| return { |
There was a problem hiding this comment.
buildManifestRow hardcodes setupDurationMs/indexDurationMs to 0
hasMeasuredSetupIndex in the evidence gate checks if (evidence.setupStatus === 'completed' && setupDuration <= 0) return false, so any row emitted by this function whose setupStatus is 'completed' will permanently fail the setup_index_cost_missing gate. Additionally, the cross-check row.setupIndex.setupDurationMs !== evidence.setupDurationMs compares the row's hardcoded 0 against the artifact's actual measured value, causing a guaranteed mismatch.
The runner script uses a separate buildManifestRowForArtifacts that receives actual measurements; this TypeScript helper should similarly require measured values rather than silently defaulting to 0.
There was a problem hiding this comment.
Addressed in the current branch: buildManifestRow now requires caller-provided setupIndex evidence instead of hardcoding setup/index durations to zero, and the runner contract test asserts setupIndex propagation into manifest rows.
| isError?: boolean; | ||
| }; | ||
|
|
||
| const SLOW_WINDOWS_TEST_TIMEOUT_MS = 60000; |
There was a problem hiding this comment.
There was a problem hiding this comment.
Acknowledged, but not changing for this PR. The duplicate timeout constant is test-local and non-blocking; extracting a shared helper would add a new file for two call sites while the active merge blockers here are ContextBench evidence/provenance contracts.
…s' into pr/contextbench-harness-core
| export interface ContextBenchScoreResult { | ||
| status: 'completed' | 'judge_failed'; | ||
| mode: 'official_evaluator' | 'diagnostic_fallback'; | ||
| claimBearing: boolean; | ||
| command: string; | ||
| stdout: string; | ||
| stderr: string; | ||
| exitStatus: number | null; | ||
| exitCode: number | null; | ||
| officialEvaluatorFirst: boolean; | ||
| officialEvaluatorAttempted: boolean; | ||
| officialEvaluatorInvoked: boolean; | ||
| outputPath: string; | ||
| fallbackReason?: string; | ||
| } | ||
|
|
||
| export interface FactRecallDiagnosticResult { | ||
| missingRequiredFacts: string[]; | ||
| missingEvidenceFiles: string[]; | ||
| unsupportedClaim: boolean; | ||
| falseReady: boolean; | ||
| reasons: string[]; | ||
| } | ||
|
|
||
| function writeJson(filePath: string, value: unknown): void { | ||
| mkdirSync(path.dirname(filePath), { recursive: true }); | ||
| writeFileSync(filePath, `${JSON.stringify(value, null, 2)}\n`, 'utf8'); | ||
| } | ||
|
|
||
| export async function scoreWithOfficialEvaluatorFirst( | ||
| params: OfficialEvaluatorParams | ||
| ): Promise<ContextBenchScoreResult> { | ||
| const args = [ | ||
| '-m', | ||
| 'contextbench.evaluate', | ||
| '--gold', | ||
| params.goldPath, | ||
| '--pred', | ||
| params.predictionPath | ||
| ]; | ||
| if (params.cachePath) args.push('--cache', params.cachePath); | ||
| args.push('--out', params.outputPath); | ||
| const command = `python ${args.join(' ')}`; | ||
| const result = await params.runner('python', args, params.cwd); | ||
| if (result.status === 0) { | ||
| const score = { | ||
| status: 'completed' as const, | ||
| mode: 'official_evaluator' as const, | ||
| claimBearing: params.claimAllowed === true, | ||
| command, | ||
| stdout: result.stdout, | ||
| stderr: result.stderr, | ||
| exitStatus: result.status, | ||
| exitCode: result.status, | ||
| officialEvaluatorFirst: true, | ||
| officialEvaluatorAttempted: true, | ||
| officialEvaluatorInvoked: true, | ||
| outputPath: params.outputPath | ||
| }; | ||
| writeJson(params.outputPath, score); | ||
| return score; | ||
| } | ||
|
|
||
| const score = { | ||
| status: 'judge_failed' as const, | ||
| mode: 'diagnostic_fallback' as const, | ||
| claimBearing: false, | ||
| command, | ||
| stdout: result.stdout, | ||
| stderr: result.stderr, | ||
| exitStatus: result.status, | ||
| exitCode: result.status, | ||
| officialEvaluatorFirst: true, | ||
| officialEvaluatorAttempted: true, | ||
| officialEvaluatorInvoked: true, | ||
| outputPath: params.outputPath, | ||
| fallbackReason: 'official_evaluator_failed' | ||
| }; | ||
| writeJson(params.outputPath, score); | ||
| return score; |
There was a problem hiding this comment.
Scorer artifact missing
stdoutPath/stderrPath; evidence gate will always reject it
scoreWithOfficialEvaluatorFirst writes stdout and stderr as inline raw text fields in the score JSON artifact. But hasOfficialEvaluatorProof in the evidence gate unconditionally checks all three of these conditions:
typeof score.stdoutPath === 'string' && score.stdoutPath.length > 0 &&
hasSha256Hash(artifactHashesByPath[score.stdoutPath]) &&
typeof score.stderrPath === 'string' && score.stderrPath.length > 0 &&
hasSha256Hash(artifactHashesByPath[score.stderrPath])Because ContextBenchScoreResult has no stdoutPath/stderrPath fields, the serialised score artifact will always have stdoutPath === undefined, causing hasOfficialEvaluatorProof to return false and permanently emitting an official_evaluator_missing failure—even for a successful, claim-allowed run.
The evidence gate test constructs stdoutPath/stderrPath by hand in passingArtifacts(), so this gap is not caught by the existing scorer tests. The scorer must write stdout/stderr to separate log files and include their paths in the score artifact for the gate contract to close.
Summary
Verification
rtk node scripts/contextbench-runner.mjs --validate-fixturesrtk node scripts/contextbench-runner.mjs --validate-lane-setuprtk pnpm exec vitest run tests/contextbench-runner-contract.test.ts tests/contextbench-lane-setup.test.ts tests/contextbench-scoring.test.ts tests/contextbench-trajectory.test.ts tests/contextbench-baseline-schema-gate.test.ts tests/contextbench-baseline-snapshot.test.ts tests/contextbench-baseline-runner.test.ts tests/contextbench-phase42-evidence-gate.test.ts tests/contextbench-protocol.test.ts tests/contextbench-task-manifest.test.tsrtk pnpm run format:checkrtk pnpm exec tsc --noEmitrtk pnpm run buildrtk git push -u origin pr/contextbench-harness-coreClaim Posture
claimAllowed, or claim Phase 42/product improvement success.